Skip to content

feat: Add repo indexing job#112136

Merged
shruthilayaj merged 14 commits intomasterfrom
shruthi/feat/add-repo-indexing-job
Apr 7, 2026
Merged

feat: Add repo indexing job#112136
shruthilayaj merged 14 commits intomasterfrom
shruthi/feat/add-repo-indexing-job

Conversation

@shruthilayaj
Copy link
Copy Markdown
Member

@shruthilayaj shruthilayaj commented Apr 2, 2026

Schedule repo indexing job for context engine. This is behind a new
"experimental" feature flag so we can see how this context works
out on sentry seer explorer runs. Only runs index
job on Sunday because we don't want to eat into GH API quotas and
interfere with code review and autofix.

Depends on: https://github.com/getsentry/seer/pull/5594

shruthilayaj and others added 2 commits April 2, 2026 14:42
Add test coverage for the new index_repos task including early return
conditions, correct payload construction, and repo deduplication across
projects. Also fix broken import path and add missing response status
check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Apr 2, 2026
Comment thread src/sentry/tasks/seer/context_engine_index.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

Backend Test Failures

Failures on 82293b8 in this run:

tests/sentry/autofix/test_utils.py::TestGetRepoFromCodeMappings::test_get_repos_from_project_code_mappings_with_datalog
[gw1] linux -- Python 3.13.1 /home/runner/work/sentry/sentry/.venv/bin/python3
tests/sentry/autofix/test_utils.py:49: in test_get_repos_from_project_code_mappings_with_data
    assert repos == expected_repos
E   AssertionError: assert [{'external_i...sentry', ...}] == [{'external_i...2577440, ...}]
E     
E     At index 0 diff: {'repository_id': 6, 'organization_id': 4557900082577440, 'integration_id': '234', 'provider': 'github', 'owner': 'getsentry', 'name': 'sentry', 'external_id': '123', 'languages': []} != {'repository_id': 6, 'integration_id': '234', 'organization_id': 4557900082577440, 'provider': 'github', 'owner': 'getsentry', 'name': 'sentry', 'external_id': '123'}
E     
E     Full diff:
E       [
E           {
E               'external_id': '123',
E               'integration_id': '234',
E     +         'languages': [],
E               'name': 'sentry',
E               'organization_id': 4557900082577440,
E               'owner': 'getsentry',
E               'provider': 'github',
E               'repository_id': 6,
E           },
E       ]

@shruthilayaj shruthilayaj marked this pull request as ready for review April 2, 2026 20:04
@shruthilayaj shruthilayaj requested a review from a team as a code owner April 2, 2026 20:04
Comment thread src/sentry/tasks/seer/context_engine_index.py Outdated
Comment thread src/sentry/seer/signed_seer_api.py
Comment thread tests/sentry/tasks/seer/test_context_engine_index.py
Comment thread src/sentry/tasks/seer/context_engine_index.py Outdated
Comment thread src/sentry/tasks/seer/context_engine_index.py
Copy link
Copy Markdown
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: NoneType AttributeError when project has no preferences
    • Updated the preferences lookup to default to an empty dict so projects without Seer preferences no longer raise an AttributeError.

Create PR

Or push these changes by commenting:

@cursor push 11d75ba62c
Preview (11d75ba62c)
diff --git a/src/sentry/tasks/seer/context_engine_index.py b/src/sentry/tasks/seer/context_engine_index.py
--- a/src/sentry/tasks/seer/context_engine_index.py
+++ b/src/sentry/tasks/seer/context_engine_index.py
@@ -259,7 +259,7 @@
     preferences_by_id = bulk_get_project_preferences(organization_id, list(project_map.keys()))
 
     for project_id, project in project_map.items():
-        existing_pref = preferences_by_id.get(str(project_id))
+        existing_pref = preferences_by_id.get(str(project_id), {})
         project_pref_repos = existing_pref.get("repositories") or []
         autofix_repos = get_autofix_repos_from_project_code_mappings(project_map[project_id])

This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.


for project_id, project in project_map.items():
existing_pref = preferences_by_id.get(str(project_id))
project_pref_repos = existing_pref.get("repositories") or []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NoneType AttributeError when project has no preferences

High Severity

preferences_by_id.get(str(project_id)) returns None when a project has no Seer preferences, then existing_pref.get("repositories") raises AttributeError: 'NoneType' object has no attribute 'get'. The bulk_get_project_preferences function returns a sparse dict — only projects with configured preferences appear as keys. Using .get(str(project_id), {}) as the default would prevent the crash.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 3243fdb. Configure here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should skip projects that don't have preferences setup. If a project does not have preferences then customers basically can't use Seer for that project.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Backend Test Failures

Failures on 955b73a in this run:

tests/sentry/tasks/seer/test_context_engine_index.py::TestIndexRepos::test_deduplicates_repos_across_projectslog
[gw0] linux -- Python 3.13.1 /home/runner/work/sentry/sentry/.venv/bin/python3
.venv/lib/python3.13/site-packages/urllib3/connection.py:204: in _new_conn
    sock = connection.create_connection(
.venv/lib/python3.13/site-packages/urllib3/util/connection.py:85: in create_connection
    raise err
.venv/lib/python3.13/site-packages/urllib3/util/connection.py:73: in create_connection
    sock.connect(sa)
E   ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:
.venv/lib/python3.13/site-packages/urllib3/connectionpool.py:787: in urlopen
    response = self._make_request(
.venv/lib/python3.13/site-packages/urllib3/connectionpool.py:493: in _make_request
    conn.request(
.venv/lib/python3.13/site-packages/urllib3/connection.py:500: in request
    self.endheaders()
/opt/hostedtoolcache/Python/3.13.1/x64/lib/python3.13/http/client.py:1331: in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
/opt/hostedtoolcache/Python/3.13.1/x64/lib/python3.13/http/client.py:1091: in _send_output
    self.send(msg)
/opt/hostedtoolcache/Python/3.13.1/x64/lib/python3.13/http/client.py:1035: in send
    self.connect()
.venv/lib/python3.13/site-packages/urllib3/connection.py:331: in connect
    self.sock = self._new_conn()
.venv/lib/python3.13/site-packages/urllib3/connection.py:219: in _new_conn
    raise NewConnectionError(
E   urllib3.exceptions.NewConnectionError: HTTPConnection(host='127.0.0.1', port=9091): Failed to establish a new connection: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:
tests/sentry/tasks/seer/test_context_engine_index.py:320: in test_deduplicates_repos_across_projects
    index_repos(self.org.id)
.venv/lib/python3.13/site-packages/taskbroker_client/task.py:92: in __call__
    return self._func(*args, **kwargs)
src/sentry/tasks/seer/context_engine_index.py:259: in index_repos
    preferences_by_id = bulk_get_project_preferences(organization_id, list(project_map.keys()))
src/sentry/seer/autofix/utils.py:735: in bulk_get_project_preferences
    response = make_bulk_get_project_preferences_request(
src/sentry/seer/autofix/utils.py:258: in make_bulk_get_project_preferences_request
    return make_signed_seer_api_request(
.venv/lib/python3.13/site-packages/sentry_sdk/tracing_utils.py:916: in sync_wrapper
    result = f(*args, **kwargs)
src/sentry/seer/signed_seer_api.py:164: in make_signed_seer_api_request
    return connection_pool.urlopen(
.venv/lib/python3.13/site-packages/urllib3/connectionpool.py:871: in urlopen
    return self.urlopen(
.venv/lib/python3.13/site-packages/urllib3/connectionpool.py:871: in urlopen
    return self.urlopen(
.venv/lib/python3.13/site-packages/urllib3/connectionpool.py:871: in urlopen
    return self.urlopen(
.venv/lib/python3.13/site-packages/urllib3/connectionpool.py:841: in urlopen
... (4 more lines)
tests/sentry/tasks/seer/test_context_engine_index.py::TestIndexRepos::test_calls_seer_with_correct_org_and_reposlog
[gw0] linux -- Python 3.13.1 /home/runner/work/sentry/sentry/.venv/bin/python3
.venv/lib/python3.13/site-packages/urllib3/connection.py:204: in _new_conn
    sock = connection.create_connection(
.venv/lib/python3.13/site-packages/urllib3/util/connection.py:85: in create_connection
    raise err
.venv/lib/python3.13/site-packages/urllib3/util/connection.py:73: in create_connection
    sock.connect(sa)
E   ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:
.venv/lib/python3.13/site-packages/urllib3/connectionpool.py:787: in urlopen
    response = self._make_request(
.venv/lib/python3.13/site-packages/urllib3/connectionpool.py:493: in _make_request
    conn.request(
.venv/lib/python3.13/site-packages/urllib3/connection.py:500: in request
    self.endheaders()
/opt/hostedtoolcache/Python/3.13.1/x64/lib/python3.13/http/client.py:1331: in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
/opt/hostedtoolcache/Python/3.13.1/x64/lib/python3.13/http/client.py:1091: in _send_output
    self.send(msg)
/opt/hostedtoolcache/Python/3.13.1/x64/lib/python3.13/http/client.py:1035: in send
    self.connect()
.venv/lib/python3.13/site-packages/urllib3/connection.py:331: in connect
    self.sock = self._new_conn()
.venv/lib/python3.13/site-packages/urllib3/connection.py:219: in _new_conn
    raise NewConnectionError(
E   urllib3.exceptions.NewConnectionError: HTTPConnection(host='127.0.0.1', port=9091): Failed to establish a new connection: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:
tests/sentry/tasks/seer/test_context_engine_index.py:281: in test_calls_seer_with_correct_org_and_repos
    index_repos(self.org.id)
.venv/lib/python3.13/site-packages/taskbroker_client/task.py:92: in __call__
    return self._func(*args, **kwargs)
src/sentry/tasks/seer/context_engine_index.py:259: in index_repos
    preferences_by_id = bulk_get_project_preferences(organization_id, list(project_map.keys()))
src/sentry/seer/autofix/utils.py:735: in bulk_get_project_preferences
    response = make_bulk_get_project_preferences_request(
src/sentry/seer/autofix/utils.py:258: in make_bulk_get_project_preferences_request
    return make_signed_seer_api_request(
.venv/lib/python3.13/site-packages/sentry_sdk/tracing_utils.py:916: in sync_wrapper
    result = f(*args, **kwargs)
src/sentry/seer/signed_seer_api.py:164: in make_signed_seer_api_request
    return connection_pool.urlopen(
.venv/lib/python3.13/site-packages/urllib3/connectionpool.py:871: in urlopen
    return self.urlopen(
.venv/lib/python3.13/site-packages/urllib3/connectionpool.py:871: in urlopen
    return self.urlopen(
.venv/lib/python3.13/site-packages/urllib3/connectionpool.py:871: in urlopen
    return self.urlopen(
.venv/lib/python3.13/site-packages/urllib3/connectionpool.py:841: in urlopen
... (4 more lines)

Comment thread src/sentry/tasks/seer/context_engine_index.py Outdated
Comment thread src/sentry/tasks/seer/context_engine_index.py
Comment thread src/sentry/tasks/seer/context_engine_index.py
Comment thread src/sentry/tasks/seer/context_engine_index.py
key = (repo["provider"], repo["owner"], repo["name"])
if key in org_repo_definitions:
repo_definition = org_repo_definitions[key]
repo_definition["project_ids"].append(project_id)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repo languages lost during cross-project deduplication

Low Severity

When a repo already exists in org_repo_definitions, only project_ids is appended — the languages field is never backfilled. If the first project to register a repo uses seer preferences (where the repo isn't in that project's autofix code mappings), languages is set to [] via language_map.get(key, []). When a later project encounters the same repo from its autofix repos (which do have language data), the existing entry's empty languages is never updated, permanently losing that information.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit e1ab127. Configure here.

"provider": repo["provider"],
"owner": repo["owner"],
"name": repo["name"],
"external_id": repo["external_id"],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The code unsafely accesses keys on a raw dictionary from an API response, which will raise a KeyError if the response is malformed or missing expected keys.
Severity: HIGH

Suggested Fix

Use the safe .get() method when accessing keys from the repo dictionary to prevent KeyError exceptions. For a more robust solution, validate the raw API response with a Pydantic model before processing the data to ensure the data structure is correct.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: src/sentry/tasks/seer/context_engine_index.py#L285

Potential issue: The function `index_repos` processes repository data fetched from the
Seer API via `bulk_get_project_preferences()`. The code directly accesses dictionary
keys like `repo["external_id"]`, `repo["provider"]`, `repo["owner"]`, and `repo["name"]`
without using safe access methods like `.get()`. The API response is not validated
against a schema. If the Seer API returns a malformed response object that is missing
one of these required keys, the operation will fail with a `KeyError`. This will cause
the `index_repos` background task to crash, preventing repository indexing for the
affected organization.

language_map: dict[tuple[str, str, str], list[str]] = {}
for autofix_repo in autofix_repos:
key = (autofix_repo["provider"], autofix_repo["owner"], autofix_repo["name"])
language_map[key] = autofix_repo["languages"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The index_repos task will crash with a KeyError if a repository configured via SEER_AUTOFIX_FORCE_USE_REPOS is missing the languages key.
Severity: MEDIUM

Suggested Fix

Use the .get() method with a default value when accessing the languages key to prevent a KeyError. Change autofix_repo["languages"] to autofix_repo.get("languages", []). This will provide a safe fallback to an empty list if the key is not present in the repository configuration.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: src/sentry/tasks/seer/context_engine_index.py#L271

Potential issue: When the `SEER_AUTOFIX_FORCE_USE_REPOS` setting is used, for example in
testing or staging environments, the `index_repos` task can fail. The code iterates
through the configured repositories and directly accesses the `languages` key from each
repository dictionary. However, unlike the standard code path, the logic for this
setting does not ensure the `languages` key is present. If a repository is configured
without this key, the task will raise a `KeyError` and fail, as this exception is not
configured for retries. This will halt the repository indexing process in environments
that use this override.


for project_id, project in project_map.items():
existing_pref = preferences_by_id.get(str(project_id))
project_pref_repos = existing_pref.get("repositories") or []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should skip projects that don't have preferences setup. If a project does not have preferences then customers basically can't use Seer for that project.

Comment thread src/sentry/tasks/seer/context_engine_index.py Outdated
Copy link
Copy Markdown
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit cf5c5a4. Configure here.

)

if response.status >= 400:
raise SeerApiError("Seer request failed", response.status)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing early return when no repos are collected

Low Severity

The index_repos function makes a Seer API call even when org_repo_definitions is empty (e.g., when all projects lack preferences or have empty/None repository lists). Other similar tasks like build_service_map and index_org_project_knowledge include early returns for analogous "no data" scenarios (no nodes, no high-volume projects). Adding an early return when org_repo_definitions is empty would avoid unnecessary API calls, which can add up since this runs across many orgs.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit cf5c5a4. Configure here.

@shruthilayaj shruthilayaj merged commit d5c1cfe into master Apr 7, 2026
80 checks passed
@shruthilayaj shruthilayaj deleted the shruthi/feat/add-repo-indexing-job branch April 7, 2026 18:33
george-sentry pushed a commit that referenced this pull request Apr 9, 2026
Schedule repo indexing job for context engine. This is behind a new
"experimental" feature flag so we can see how this context works
out on sentry seer explorer runs. Only runs index
job on Sunday because we don't want to eat into GH API quotas and
interfere with code review and autofix.

Depends on: getsentry/seer#5594

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants